An automated procedure to identify biomedical articles that contain cancer-associated gene variants.
نویسندگان
چکیده
The proliferation of biomedical literature makes it increasingly difficult for researchers to find and manage relevant information. However, identifying research articles containing mutation data, a requisite first step in integrating large and complex mutation data sets, is currently tedious, time-consuming and imprecise. More effective mechanisms for identifying articles containing mutation information would be beneficial both for the curation of mutation databases and for individual researchers. We developed an automated method that uses information extraction, classifier, and relevance ranking techniques to determine the likelihood of MEDLINE abstracts containing information regarding genomic variation data suitable for inclusion in mutation databases. We targeted the CDKN2A (p16) gene and the procedure for document identification currently used by CDKN2A Database curators as a measure of feasibility. A set of abstracts was manually identified from a MEDLINE search as potentially containing specific CDKN2A mutation events. A subset of these abstracts was used as a training set for a maximum entropy classifier to identify text features distinguishing "relevant" from "not relevant" abstracts. Each document was represented as a set of indicative word, word pair, and entity tagger-derived genomic variation features. When applied to a test set of 200 candidate abstracts, the classifier predicted 88 articles as being relevant; of these, 29 of 32 manuscripts in which manual curation found CDKN2A sequence variants were positively predicted. Thus, the set of potentially useful articles that a manual curator would have to review was reduced by 56%, maintaining 91% recall (sensitivity) and more than doubling precision (positive predictive value). Subsequent expansion of the training set to 494 articles yielded similar precision and recall rates, and comparison of the original and expanded trials demonstrated that the average precision improved with the larger data set. Our results show that automated systems can effectively identify article subsets relevant to a given task and may prove to be powerful tools for the broader research community. This procedure can be readily adapted to any or all genes, organisms, or sets of documents.
منابع مشابه
Computational approach towards identification of pathogenic missense mutations in AMELX gene and their possible association with amelogenesis imperfecta
Amelogenin gene (AMEL-X) encodes an enamel protein called amelogenin, which plays a vital role in tooth development. Any mutations in this gene or the associated pathway lead to developmental abnormalities of the tooth. The present study aims to analyze functional missense mutations in AMEL-X genes and derive an association with amelogenesis imperfecta. The information on miss...
متن کاملCTLA4 Gene Variants in Autoimmunity and Cancer: a Comparative Review
Gene association studies are less appealing in cancer compared to autoimmune diseases. Complexity, heterogeneity, variation in histological types, age at onset, short survival, and acute versus chronic conditions are cancer related factors which are different from an organ specific autoimmune disease, such as Grave’s disease, on which a large body of multicentre data is accumulated. For years t...
متن کاملAssociation of a New Germline Variant in the MUTYH DNA Glycosylase Gene with Colorectal Adenoma Transformation into Malignancy
Background: MUTYH DNA glycosylase germline mutations are linked to the recessive inheritance of multiple adenoma. Studies have revealed that germline mutations in this gene are ethnicity related. This study aimed to identify the germline mutations in MUTYH gene and determine their prevalence among Jordanian patients with colorectal adenoma. Methods: In this study, 150 colorectal adenoma patient...
متن کاملApplication of Information Technology: A Statistical Approach to Scanning the Biomedical Literature for Pharmacogenetics Knowledge
OBJECTIVE Biomedical databases summarize current scientific knowledge, but they generally require years of laborious curation effort to build, focusing on identifying pertinent literature and data in the voluminous biomedical literature. It is difficult to manually extract useful information embedded in the large volumes of literature, and automated intelligent text analysis tools are becoming ...
متن کاملFunctional investigation of the BRCA1 Val1714Gly and Asp1733Gly variants by computational tools and yeast transcription activation assay
Mutations in the BRCA1 gene are known to be a major cause of hereditary breast cancer. However, characterizing the point mutationsassociated with cancer in BRCA1 is challenging because the functional impact of most of them is still unknown. Nowadays, a variety of methods are employed to identify cancer-associated mutations in BRCA1. This study is aimed to ass...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Human mutation
دوره 27 9 شماره
صفحات -
تاریخ انتشار 2006